Introduction

In this project, we analyze a dataset of National Basketball Association (NBA) player statistics sourced from NBA.com and basketball-reference.com (see links at bottome of page). Our aim is to derive a model that can effectively predict a player’s net +/- rating, a metric indicative of a player’s impact on the court.

The dataset contains various details of NBA players’ performances, such as their points scored (PTS), field goal percentage (FG.%), three-point percentage (X3P.%), free throw attempts (FTA), offensive rebounds (OREB), defensive rebounds (DREB), assists (AST), turnovers (TOV), steals (STL), blocks (BLK), personal fouls (PF), and player position.

In basketball, players need to excel in several areas to contribute positively to their team’s performance. Understanding the combination of statistics that best predicts a player’s impact can provide valuable insights for team management in player evaluation, strategic decision-making, and overall team composition.

We will attempt to create a model that can predict a player’s net rating based on their individual performance metrics. We will use multiple linear regression modeling techniques and check LINE assumptions. We will address potential challenges such as multicollinearity and heteroscedasticity, if present. We will also evaluate our model’s performance using metrics such as the Adjusted R-squared value and conduct diagnostics to identify potential outliers and influential points.

Methods

player_data = read.csv("nbaplayerstats.csv") 
team_data = read.csv("nbateamstats.csv")
position_data = read.csv("nba_position_data.csv")

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
position_data <- position_data %>%
  group_by(First_Name, Last_Name) %>%
  summarise(Position = names(which.max(table(Position))))
## `summarise()` has grouped output by 'First_Name'. You can override using the
## `.groups` argument.
player_data <- merge(player_data, position_data, by = c("First_Name", "Last_Name"), all.x = TRUE)

head(player_data)
##   First_Name Last_Name TEAM AGE GP  W  L    MIN  PTS FGM FGA  FG. X3PM X3PA
## 1       A.J.    Lawson  DAL  22 15  5 10  108.3   56  22  44 50.0   10   25
## 2      Aaron    Gordon  DEN  25 50 29 21 1383.8  618 231 499 46.3   59  176
## 3      Aaron    Gordon  DEN  27 68 45 23 2055.1 1109 429 761 56.4   60  173
## 4      Aaron    Gordon  DEN  26 75 46 29 2375.4 1126 434 834 52.0   87  260
## 5      Aaron    Gordon  ORL  24 62 30 32 2017.1  894 335 767 43.7   73  237
## 6      Aaron     Henry  PHI  22  6  6  0   17.0    2   1   5 20.0    0    1
##   X3P. FTM FTA  FT. OREB DREB REB AST TOV STL BLK  PF   FP DD2 TD3 rating Year
## 1 40.0   2   8 25.0    6   15  21   2   3   2   0  11   87   0   0    -46 2023
## 2 33.5  97 149 65.1   77  207 284 161  97  33  34  89 1304   3   1     60 2021
## 3 34.7 191 314 60.8  164  282 446 203  98  54  51 129 2166  11   0    518 2023
## 4 33.5 171 230 74.3  125  314 439 188 133  44  44 148 2066   6   0    321 2022
## 5 30.8 151 224 67.4  107  368 475 228 100  51  39 125 1976  20   1    -68 2020
## 6  0.0   0   0  0.0    0    1   1   0   2   0   2   2    7   0   0    -20 2022
##   Position
## 1       SG
## 2       PF
## 3       PF
## 4       PF
## 5       PF
## 6       SF
summary(player_data)
##   First_Name         Last_Name             TEAM                AGE       
##  Length:2213        Length:2213        Length:2213        Min.   :19.00  
##  Class :character   Class :character   Class :character   1st Qu.:23.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :25.00  
##                                                           Mean   :26.08  
##                                                           3rd Qu.:29.00  
##                                                           Max.   :43.00  
##        GP           W               L              MIN              PTS        
##  Min.   : 1   Min.   : 0.00   Min.   : 0.00   Min.   :   0.6   Min.   :   0.0  
##  1st Qu.:24   1st Qu.:10.00   1st Qu.:11.00   1st Qu.: 267.3   1st Qu.:  92.0  
##  Median :49   Median :21.00   Median :22.00   Median : 924.2   Median : 342.0  
##  Mean   :44   Mean   :22.08   Mean   :21.92   Mean   :1004.1   Mean   : 466.9  
##  3rd Qu.:64   3rd Qu.:34.00   3rd Qu.:31.00   3rd Qu.:1653.8   3rd Qu.: 704.0  
##  Max.   :83   Max.   :64.00   Max.   :60.00   Max.   :2963.2   Max.   :2335.0  
##       FGM             FGA              FG.              X3PM       
##  Min.   :  0.0   Min.   :   0.0   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 35.0   1st Qu.:  79.0   1st Qu.: 40.20   1st Qu.:  4.00  
##  Median :126.0   Median : 272.0   Median : 44.50   Median : 31.00  
##  Mean   :171.2   Mean   : 367.4   Mean   : 44.63   Mean   : 51.62  
##  3rd Qu.:261.0   3rd Qu.: 564.0   3rd Qu.: 49.80   3rd Qu.: 82.00  
##  Max.   :774.0   Max.   :1564.0   Max.   :100.00   Max.   :337.00  
##       X3PA            X3P.             FTM              FTA        
##  Min.   :  0.0   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 15.0   1st Qu.: 25.40   1st Qu.: 11.00   1st Qu.: 15.00  
##  Median : 92.0   Median : 33.30   Median : 41.00   Median : 56.00  
##  Mean   :143.6   Mean   : 29.86   Mean   : 72.92   Mean   : 93.87  
##  3rd Qu.:234.0   3rd Qu.: 37.90   3rd Qu.: 98.00   3rd Qu.:129.00  
##  Max.   :843.0   Max.   :100.00   Max.   :692.00   Max.   :803.00  
##       FT.              OREB             DREB            REB        
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 65.90   1st Qu.:  9.00   1st Qu.: 34.0   1st Qu.:  46.0  
##  Median : 76.10   Median : 27.00   Median :112.0   Median : 143.0  
##  Mean   : 70.69   Mean   : 42.32   Mean   :141.5   Mean   : 183.8  
##  3rd Qu.: 83.70   3rd Qu.: 58.00   3rd Qu.:210.0   3rd Qu.: 271.0  
##  Max.   :100.00   Max.   :349.00   Max.   :813.0   Max.   :1019.0  
##       AST             TOV              STL             BLK        
##  Min.   :  0.0   Min.   :  0.00   Min.   :  0.0   Min.   :  0.00  
##  1st Qu.: 17.0   1st Qu.: 12.00   1st Qu.:  7.0   1st Qu.:  4.00  
##  Median : 63.0   Median : 41.00   Median : 26.0   Median : 12.00  
##  Mean   :103.1   Mean   : 55.67   Mean   : 31.3   Mean   : 19.86  
##  3rd Qu.:137.0   3rd Qu.: 80.00   3rd Qu.: 48.0   3rd Qu.: 26.00  
##  Max.   :763.0   Max.   :312.00   Max.   :138.0   Max.   :196.00  
##        PF               FP              DD2              TD3         
##  Min.   :  0.00   Min.   :  -2.0   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.: 26.00   1st Qu.: 215.0   1st Qu.: 0.000   1st Qu.: 0.0000  
##  Median : 78.00   Median : 762.0   Median : 0.000   Median : 0.0000  
##  Mean   : 82.75   Mean   : 939.9   Mean   : 3.646   Mean   : 0.2219  
##  3rd Qu.:126.00   3rd Qu.:1449.0   3rd Qu.: 3.000   3rd Qu.: 0.0000  
##  Max.   :286.00   Max.   :4338.0   Max.   :66.000   Max.   :38.0000  
##      rating          Year        Position        
##  Min.   :-642   Min.   :2020   Length:2213       
##  1st Qu.: -66   1st Qu.:2021   Class :character  
##  Median :  -8   Median :2022   Mode  :character  
##  Mean   :   0   Mean   :2022                     
##  3rd Qu.:  51   3rd Qu.:2022                     
##  Max.   : 728   Max.   :2023
na_positions <- sum(is.na(player_data$Position))

print(paste("Number of players without position: ", na_positions))
## [1] "Number of players without position:  0"
# Subset the data for players without position
#missing_positions <- player_data[is.na(player_data$Position),]

#print(missing_positions)

Now we have the data combined with all of our position data.

colnames(player_data)
##  [1] "First_Name" "Last_Name"  "TEAM"       "AGE"        "GP"        
##  [6] "W"          "L"          "MIN"        "PTS"        "FGM"       
## [11] "FGA"        "FG."        "X3PM"       "X3PA"       "X3P."      
## [16] "FTM"        "FTA"        "FT."        "OREB"       "DREB"      
## [21] "REB"        "AST"        "TOV"        "STL"        "BLK"       
## [26] "PF"         "FP"         "DD2"        "TD3"        "rating"    
## [31] "Year"       "Position"
colnames(team_data)
##  [1] "City"   "Team"   "GP"     "W"      "L"      "WIN."   "MIN"    "PTS"   
##  [9] "FGM"    "FGA"    "FG."    "X3PM"   "X3PA"   "X3P."   "FTM"    "FTA"   
## [17] "FT."    "OREB"   "DREB"   "REB"    "AST"    "TOV"    "STL"    "BLK"   
## [25] "BLKA"   "PF"     "PFD"    "rating" "Year"

Here we check the column names of our player data and team data so we know what parameters we are dealing with.

team_name_mapping <- c(
  "ATL" = "Hawks",
  "BKN" = "Nets",
  "BOS" = "Celtics",
  "CHA" = "Hornets",
  "CHI" = "Bulls",
  "CLE" = "Cavaliers",
  "DAL" = "Mavericks",
  "DEN" = "Nuggets",
  "DET" = "Pistons",
  "GSW" = "Warriors",
  "HOU" = "Rockets",
  "IND" = "Pacers",
  "LAC" = "Clippers",
  "LAL" = "Lakers",
  "MEM" = "Grizzlies",
  "MIA" = "Heat",
  "MIL" = "Bucks",
  "MIN" = "Timberwolves",
  "NOP" = "Pelicans",
  "NYK" = "Knicks",
  "OKC" = "Thunder",
  "ORL" = "Magic",
  "PHI" = "76ers",
  "PHX" = "Suns",
  "POR" = "Trail Blazers",
  "SAC" = "Kings",
  "SAS" = "Spurs",
  "TOR" = "Raptors",
  "UTA" = "Jazz",
  "WAS" = "Wizards"
)

player_data$TEAM <- team_name_mapping[player_data$TEAM]

We modify the names of the teams in the player data to match the names of the teams in the team data.

library(dplyr)
player_data <- player_data %>%
  mutate(per_game_rating = rating / GP)

team_data_small <- team_data %>%
  select(Team, Year, rating)

names(team_data_small) <- c("TEAM", "Year", "team_rating")

player_data <- left_join(player_data, 
                         team_data_small, 
                         by = c("TEAM" = "TEAM", "Year" = "Year"))

player_data <- player_data %>%
  mutate(player_net_rating = (per_game_rating - team_rating) * GP)

We create a new column called player_net_rating which calculates each player’s +/- rating relative to his team’s net +/- rating. This will insure that players on teasm’s that are better don’t get an advantage over player’s on worse teams.

# Subset data based on total minutes played condition
player_data <- player_data[player_data$MIN >= 100, ]

This removes players who played less than 100 minutes as these players are often statistical outliers having not played enough minutes for the law of large numbers to kick in.

library(ggplot2)

variables <- c("AGE", "PTS", "FG.", "X3P.", "FT.", "OREB", "DREB", "AST", "TOV", "STL", "BLK", "PF", "Position")

for (var in variables) {
  p = ggplot(player_data, aes_string(x = var, y = "rating")) +
    geom_point() + 
    labs(x = var, y = "Player Net Rating") + 
    theme_minimal() + 
    ggtitle(paste("Scatterplot of", var, "vs Player Net Rating")) 
  
  print(p)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

These scatterplots show the relationship between our parameters and the player net ratings. This will help us determine some of the transformations we should try on our parameters.

full_model = lm(player_net_rating ~ AGE+PTS+I(FG.^2)+I(X3P.^2)+I(FT.^2)+OREB+DREB+AST+STL+BLK+as.factor(Position)+TOV+PF, data=player_data)
new_model = lm(I(player_net_rating^(1/3)) ~ I(1/(AGE^2))+PTS+I(FG.^2)+I(X3P.^2)+I(FT.^2)+DREB+AST+OREB+STL+BLK, data=player_data)
model = new_model
#model = lm(player_net_rating ~ .-First_Name-Last_Name-TEAM-Year-rating-per_game_rating-team_rating-L-FTM-REB-W-FP-DD2-TD3-MIN-FGA-FGM-X3PA-X3PM-FTA-GP, data = player_data)
#summary(model)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(model)
## I(1/(AGE^2))          PTS     I(FG.^2)    I(X3P.^2)     I(FT.^2)         DREB 
##     1.083603     4.699377     1.808656     1.103996     1.426448     5.894531 
##          AST         OREB          STL          BLK 
##     3.584875     4.371550     2.755248     2.544944

We started with the mostly full model that removed the columns that weren’t helpful to our goal of predicting player net rating. Using player_net_rating~.-First_Name-Last_Name-TEAM-Year-rating-per_game_rating-team_rating. We also removed categories wins, minutes, games played etc. which were not helpful from a prediction standpoint. We also then ran vif() and found a bunch of parameters that overlapped with each other causing multicollinearity issues. Such as total rebounds including offensive and defensive rebounds combined.

We didn’t want team to impact our results, the last two were used for creating the response and the others were essentially row identifiers. That left us with the full model above. After looking at the scatterplots though we decided to remove turnovers and personal fouls because what was happening is that players with higher turnovers and personal fouls actually had higher net ratings because these were the players that play a lot and have more opportunity to make mistakes or in some cases commit valuable fouls. The player’s that play a lot also tend to be the best players so we realized that we should leave those two out since the relationship was deceptive. The position category did have some value but we remove it here because we were more interested in what statistical categories a player could contribute to than what position they happen to marked as by their team. We will also ultimately reacquire the position information circumstantially if for example we discover that rebounds is highly predictive for net rating than we would know that centers provide outsize value because they are the ones who are often the bulk rebounders.

The other changes we made were to transform the three shooting percentages to powers of 2 because the scatterplots show a nonlinear relationship with net rating. We also did a transformation of the age because we could see that the youngest player had the most improvement as they got older, then we reached an almost plateau and the oldest players started to lose value as they aged (at a slower rate than the younger players gained so we also considered a log relationship). These older players then tended to retire before their value presumably would have dropped off even more. The last thing we did was modify the response to take the third root of the player net rating. We did this because our Q-Q plot and our fitted vs residuals plot were wildly off and were violating all our assumptions.

library(ggplot2)
residuals = resid(model)
predicted = predict(model)

#show the fitted vs redisuals
plot(fitted(model), resid(model), col = "grey", pch = 20,
     xlab = "Fitted", ylab = "Residuals", main = "Fitted versus Residuals")
abline(h = 0, col = "darkorange", lwd = 2)

#show the Q-Q plot
qqnorm(resid(model), main = "Normal Q-Q Plot", col = "darkgrey")
qqline(resid(model), col = "dodgerblue", lwd = 2)

cooksd = cooks.distance(model)
plot(cooksd, pch="*", cex=2, main="Influence of Observations (Cook's distance)")
abline(h = 4*mean(cooksd, na.rm=T), col="red")

sum(cooks.distance(model) > 4 / length(cooks.distance(model)))
## [1] 39

Here we test the assumptions of our model and we can see some fairly good results. For linearity we used our scatterplots to check which parameters did not have linear relationships and transformed the ones that didn’t. Our fitted vs residuals plot looks mostly evenly scattered with maybe a handful of outliers but much better than our previous attempts and it the width of the points is about the same too. So we should feel good about homoscedasticity and independence now. In the Normal Q-Q plot our data adheres very close to our line, much better than our previous attempts. We can see very little tail which is good. So we can be fairly confident normality is not violated.

Let’s see if we can find a good model that uses fewer of our parameters:

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
null_model <- lm(player_net_rating ~ 1, data = player_data)

scope <- list(lower = null_model, upper = model)

backward_aic <- step(model, direction = "backward", trace = FALSE)
forward_aic <- step(null_model, scope = scope, direction = "forward", trace = FALSE)
both_aic <- step(model, scope = scope, direction = "both", trace = FALSE)

backward_bic <- step(model, k = log(nrow(player_data)), direction = "backward", trace = FALSE)
forward_bic <- step(model, scope = scope, k = log(nrow(player_data)), direction = "forward", trace = FALSE)
both_bic <- step(model, scope = scope, k = log(nrow(player_data)), direction = "both", trace = 0)

print(backward_aic) 
## 
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ I(X3P.^2) + OREB + 
##     STL, data = player_data)
## 
## Coefficients:
## (Intercept)    I(X3P.^2)         OREB          STL  
##   4.1898332    0.0001207    0.0018428    0.0042514
print(forward_aic) 
## 
## Call:
## lm(formula = player_net_rating ~ AST + I(1/(AGE^2)) + BLK, data = player_data)
## 
## Coefficients:
##  (Intercept)           AST  I(1/(AGE^2))           BLK  
##    -109.7533        0.2334    42830.8229        0.3568
print(both_aic) 
## 
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ I(X3P.^2) + OREB + 
##     STL, data = player_data)
## 
## Coefficients:
## (Intercept)    I(X3P.^2)         OREB          STL  
##   4.1898332    0.0001207    0.0018428    0.0042514
print(backward_bic) 
## 
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ STL, data = player_data)
## 
## Coefficients:
## (Intercept)          STL  
##    4.365881     0.005778
print(forward_bic)
## 
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ I(1/(AGE^2)) + PTS + 
##     I(FG.^2) + I(X3P.^2) + I(FT.^2) + DREB + AST + OREB + STL + 
##     BLK, data = player_data)
## 
## Coefficients:
##  (Intercept)  I(1/(AGE^2))           PTS      I(FG.^2)     I(X3P.^2)  
##    3.991e+00     8.841e+01    -8.087e-05    -9.793e-05     1.170e-04  
##     I(FT.^2)          DREB           AST          OREB           STL  
##    4.422e-05    -4.120e-05    -2.098e-04     3.135e-03     4.703e-03  
##          BLK  
##    7.677e-04
print(both_bic) 
## 
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ STL, data = player_data)
## 
## Coefficients:
## (Intercept)          STL  
##    4.365881     0.005778

We can see we have a lot of models that include steals and offensive rebounds and three point percentage also appear a couple times. We will now try including an interaction between steals and offensive rebounds.

new_model = lm(I(player_net_rating^(1/3)) ~ log(AGE)+PTS+I(FG.^2)+I(X3P.^2)+I(FT.^2)+DREB+AST+OREB*STL+BLK, data=player_data)
model = new_model
library(MASS)
null_model <- lm(player_net_rating ~ 1, data = player_data)

scope <- list(lower = null_model, upper = model)

backward_aic <- step(model, direction = "backward", trace = FALSE)
forward_aic <- step(null_model, scope = scope, direction = "forward", trace = FALSE)
both_aic <- step(model, scope = scope, direction = "both", trace = FALSE)

backward_bic <- step(model, k = log(nrow(player_data)), direction = "backward", trace = FALSE)
forward_bic <- step(model, scope = scope, k = log(nrow(player_data)), direction = "forward", trace = FALSE)
both_bic <- step(model, scope = scope, k = log(nrow(player_data)), direction = "both", trace = 0)

print(backward_aic) 
## 
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ I(FG.^2) + OREB + STL + 
##     OREB:STL, data = player_data)
## 
## Coefficients:
## (Intercept)     I(FG.^2)         OREB          STL     OREB:STL  
##   4.3486634   -0.0001456    0.0090851    0.0107939   -0.0001341
print(forward_aic) 
## 
## Call:
## lm(formula = player_net_rating ~ AST + log(AGE) + BLK, data = player_data)
## 
## Coefficients:
## (Intercept)          AST     log(AGE)          BLK  
##    373.8390       0.2343    -128.0563       0.3555
print(both_aic) 
## 
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ I(FG.^2) + OREB + STL + 
##     OREB:STL, data = player_data)
## 
## Coefficients:
## (Intercept)     I(FG.^2)         OREB          STL     OREB:STL  
##   4.3486634   -0.0001456    0.0090851    0.0107939   -0.0001341
print(backward_bic) 
## 
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ OREB + STL + OREB:STL, 
##     data = player_data)
## 
## Coefficients:
## (Intercept)         OREB          STL     OREB:STL  
##   4.0871716    0.0071683    0.0111633   -0.0001242
print(forward_bic)
## 
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ log(AGE) + PTS + I(FG.^2) + 
##     I(X3P.^2) + I(FT.^2) + DREB + AST + OREB * STL + BLK, data = player_data)
## 
## Coefficients:
## (Intercept)     log(AGE)          PTS     I(FG.^2)    I(X3P.^2)     I(FT.^2)  
##   4.597e+00   -1.820e-01   -1.300e-04   -1.276e-04    9.358e-05    3.982e-05  
##        DREB          AST         OREB          STL          BLK     OREB:STL  
##   1.099e-04   -1.506e-04    9.333e-03    1.145e-02   -3.680e-04   -1.287e-04
print(both_bic) 
## 
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ OREB + STL + OREB:STL, 
##     data = player_data)
## 
## Coefficients:
## (Intercept)         OREB          STL     OREB:STL  
##   4.0871716    0.0071683    0.0111633   -0.0001242

We ran these one more time including an interaction this time between steals and offensive rebounds and we got some models than include just those three parameters. We tried adding and subtracting the various shot percentage stats but none made a significant enough impact to include them. So leaving those out we now check our this smaller model against the new model from above.

new_model = lm(I(player_net_rating^(1/3)) ~ log(AGE)+PTS+I(FG.^2)+I(X3P.^2)+I(FT.^2)+DREB+AST+OREB*STL+BLK, data=player_data)
model4= lm(formula = I(player_net_rating^(1/3)) ~ OREB*STL, data = player_data)
summary(model4)
## 
## Call:
## lm(formula = I(player_net_rating^(1/3)) ~ OREB * STL, data = player_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4551 -1.0344  0.0783  0.9905  3.9749 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.087e+00  1.170e-01  34.928  < 2e-16 ***
## OREB         7.168e-03  2.014e-03   3.559 0.000392 ***
## STL          1.116e-02  2.796e-03   3.993 7.05e-05 ***
## OREB:STL    -1.242e-04  3.973e-05  -3.127 0.001823 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.45 on 903 degrees of freedom
##   (974 observations deleted due to missingness)
## Multiple R-squared:  0.02494,    Adjusted R-squared:  0.02171 
## F-statistic:   7.7 on 3 and 903 DF,  p-value: 4.395e-05
anova(model4,new_model)
## Analysis of Variance Table
## 
## Model 1: I(player_net_rating^(1/3)) ~ OREB * STL
## Model 2: I(player_net_rating^(1/3)) ~ log(AGE) + PTS + I(FG.^2) + I(X3P.^2) + 
##     I(FT.^2) + DREB + AST + OREB * STL + BLK
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    903 1898.0                           
## 2    895 1883.7  8     14.25 0.8464 0.5619

We see with only these three parameters we get a large p value from the anova test that has us preferring the smaller model. Our individual paramters also all have p values of significance. Our R squared values aren’t particularly high but that is the trade off for the leaner model.

hat_values <- hatvalues(model4)
high_leverage_points_indices <- which(hat_values > 2*length(coef(model4))/nrow(player_data))  # high leverage points
num_high_leverage_points <- sum(hat_values > 2*length(coef(model))/nrow(player_data))

cooks_D <- cooks.distance(model4)
influential_outlier_indices <- which(cooks_D > 4/nrow(player_data))  # influential points
num_influential_outliers <- sum(cooks_D > 4/nrow(player_data))
#print(influential_outlier_indices)

studentized_residuals <- rstudent(model4)
outlier_indices <- which(abs(studentized_residuals) > 3)
num_outliers <- sum(abs(studentized_residuals) > 3)


print(paste("Number of outliers:", num_outliers))
## [1] "Number of outliers: 2"
print(paste("Number of influential outliers:", num_influential_outliers))
## [1] "Number of influential outliers: 98"
print(paste("Number of high leverage points:", num_high_leverage_points))
## [1] "Number of high leverage points: 50"

Here we check our influential points, outlier points and high leverage points. This is how we knew it made sense to remove players with less than 100 minutes played because they were outliers in the dataset. For example, some players had a 0% field goal percentage which makes sense because if a player doesn’t take a lot of shots they might miss the one or two that they do take, which isn’t a good representative sample of a player’s abilities.

Result

model = model4
library(ggplot2)
residuals = resid(model)
predicted = predict(model)

#show the fitted vs redisuals
plot(fitted(model), resid(model), col = "grey", pch = 20,
     xlab = "Fitted", ylab = "Residuals", main = "Fitted versus Residuals")
abline(h = 0, col = "darkorange", lwd = 2)

#show the Q-Q plot
qqnorm(resid(model), main = "Normal Q-Q Plot", col = "darkgrey")
qqline(resid(model), col = "dodgerblue", lwd = 2)

cooksd = cooks.distance(model)
plot(cooksd, pch="*", cex=2, main="Influence of Observations (Cook's distance)")
abline(h = 4*mean(cooksd, na.rm=T), col="red")

sum(cooks.distance(model) > 4 / length(cooks.distance(model)))
## [1] 36

We now check the assumptions of our smaller model and they are even better than our previous assumption checks.

Discussion

We end up with a model that indicates that steals and offensive rebounds are the best predictors for how a player contributes to their team’s success via their net rating. This make quite a lot of sense. The two stats have a major commonality. They both give their team an extra possession. An extra possession means a full opportunity to score again at any possible point value. The stats like the shot percentage stats are useful but not as much. If we think about it from an expected value perspective: let’s say league average field goal percentage is 50%. If a player gives their team an extra possession and they take a two point shot they gain 1 expected point. Versus not having an extra possession but instead boosting field goal percentage from 50% to a DeAndre Jordan like 76.3% in 2020-21. On a two point shot that would be worth an additional 0.526 points, less than the 1 we gained from our steal/offensive rebound.

Should NBA teams only go after players with high steal and offensive rebound totals? There a lot of other factors to consider in team building and improving shot percentage for example still matters (especially at the margins in close games) just maybe less so than these two categories do in the long run. So a better takeaway would be that there is significant value to gain from picking players that are better at steals and offensive rebounds than other similar players at their position and with similar stats.

Appendix

Data from: https://www.nba.com/stats/players/traditional?PerMode=Totals&Season=2020-21&SeasonType=Regular%20Season&dir=A&sort=FG_PCT https://www.nba.com/stats/teams/traditional?SeasonType=Regular+Season&Season=2019-20 and https://www.basketball-reference.com/ https://www.basketball-reference.com/leagues/NBA_2023_totals.html